LAX: An Efficient Approximate XML Join Based on Clustered Leaf Nodes for XML Data Integration

نویسندگان

Wenxin Liang

Haruo Yokota

چکیده

Recently, more and more data are published and exchanged by XML on the Internet. However, different XML data sources might contain the same data but have different structures. Therefore, it requires an efficient method to integrate such XML data sources so that more complete and useful information can be conveniently accessed and acquired by users. The tree edit distance is regarded as an effective metric for evaluating the structural similarity in XML documents. However, its computational cost is extremely expensive and the traditional wisdom in join algorithms cannot be applied easily. In this paper, we propose LAX (Leaf-clustering based Approximate XML join algorithm), in which the two XML document trees are clustered into subtrees representing independent items and the similarity between them is determined by calculating the similarity degree based on the leaf nodes of each pair of subtrees. We also propose an effective algorithm for clustering the XML document for LAX. We show that it is easily to apply the traditional wisdom in join algorithms to LAX and the join result contains complete information of the two documents. We then do experiments to compare LAX with the tree edit distance and evaluate its performance using both synthetic and real data sets. Our experimental results show that LAX is more effcient in performance and more effective for measuring the approximate similarity between XML documents than the tree edit distance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid Approximate XML Subtree Matching Method Using Syntactic Features and Word Semantics

With the exponential increase in the amount and size of XML data on the Internet, XML subtree matching has become important for many application areas such as change detection, keyword retrieval and knowledge discoveries over XML documents. In our previous work, we have proposed leaf-clustering based approximate XML subtree matching methods using syntax information of both the clustered leaf no...

متن کامل

Towards Cost-based Optimizations of Twig Content-based Queries

In recent years, many approaches to indexing XML data have appeared. These approaches attempt to process XML queries efficiently and sufficient query plans are built for this purpose. Some effort has been expended in the optimization of XML query processing [20]. There are not many works that take cost-based query optimizations into account. In work [20], we find some cost-based considerations,...

متن کامل

Index-Based Approximate XML Joins

XML data integration tools are facing a variety of challenges for their efficient and effective operation. Among these is the requirement to handle a variety of inconsistencies or mistakes present in the data sets. In this paper we study the problem of integrating XML data sources through index assisted join operations, using notions of approximate match in the structure and content of XML docu...

متن کامل

Apply Uncertainty in Document-Oriented Database (MongoDB) Using F-XML

As moving to big data world where data is increasing in unstructured way with high velocity, there is a need of data-store to store this bundle amount of data. Traditionally, relational databases are used which are now not compatible to handle this large amount of data, so it is needed to move on to non-relational data-stores. In the current study, we have proposed an extension of the Mongo...

متن کامل